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1 Session S6.2: compilers and program analysis: Experience with a retarqetable 
compiler for a commercial network processor 



Jinhwan Kim, Sungjoon Jung, Yunheung Paek, Gang-Ryung Uh 
October 2002 Proceedings of the 2002 international conference on Compilers, 
architecture, and synthesis for embedded systems 

Additional Information: full citation , abstract, references , citings , index 
terms 



Full text available: ^§pdf(275.87 KB) 



The Paion PPII network processor is designed to meet the growing need for new high 
bandwidth network equipment. In order to rapidly reconfigure the processor for frequently 
varying internet services and technologies, a high performance compiler is urgently needed. 
Albeit various code generation techniques have been proposed for DSPs or ASIPs, we 
experienced these techniques are not easily tailored towards the target Paion PPII 
processor due to striking architectural differences. First, we will s ... 

Keywords: compiler, network processor, non-orthogonal architecture 



2 Specifying representations of machine instructions 
Norman Ramsey, Mary F. Fernandez 



May 1997 ACM Transactions on Programming Languages and Systems (TOP LAS), 

Volume 19 Issue 3 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: 



We present SLED, a specification language for Encoding and Decoding, which describes, 
abstract, binary, and assembly-language representations of machine instructions. Guided 
by a SLED specification, the New Jersey Machine-Code Toolkit generates bit-manipulating 
code for use in applications that process machine code. Programmers can write such 
applications at an assembly language level of abstraction, and the toolkit enables the 
applications to recognize and emit the binary representations u ... 

Keywords: compiler generation, decoding, encoding, machine code, machine description, 
object code, relocation 
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November 2002 Proceedings of the 35th annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: ^ pdf(1 34 mb)I P Additional Information: full citation , abstract , references , citings, index 
Publisher Site tenns 

Multimedia processing on embedded devices requires an architecture that leads to high 
performance, low power consumption, reduced design complexity, and small code size. In 
this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector 
architecture to superscalar and VLIW processors for embedded multimedia applications. The 
comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip 
that integrates a vector processor with DRAM main memory. We de ... 

Fast compilation for pipelined reconfigurable fabrics | 
Mihai Budiu, Seth Copen Goldstein 

February 1999 Proceedings of the 1999 ACM/SIGDA seventh international symposium 
on Field programmable gate arrays 

Full text available: fj9 pdf(2.03 MB) Additional Information: full citation , references , citings , index terms 



VISA: A variable instruction set architecture 
Alessandro De Gloria 

May 1990 ACM SIGARCH Computer Architecture News, Volume 18 issue 2 
Full text available: ^pdf(468.19 KB) Additional Information: full citation , abstract, index terms 

This paper presents an instruction coding technique that allows to reduce the instruction 
width without limiting the exploitation of the machine resources. This result can be obtained 
by a dynamic instruction coding managed by the compiler.The technique has been applied 
to a VLIW-structured machine. The machine has an instruction width of 32 bits, while the 
number of bits needed to code all the functions the machine can perform in a cycle is 98. 

Initial results on the performance and cost of vector microprocessors 
Corinna G. Lee, Derek J. DeVries 

December 1997 Proceedings of the 30th annual ACM/IEEE international symposium on 
M i croa rch i tectur e 

Full text available: ^ P( jf(i 73 MEB) l j| Additional Information: full citation , abstract , references . citingsT index 
Publisher Site tos 

Increasingly wider superscalar processors are experiencing diminishing performance returns 
while requiring larger portions of die area dedicated to control rather than datapath. As an 
alternative to using these processors to exploit parallelism effectively, we are investigating 
the viability of using single-chip vector microprocessors. This paper presents some initial 
results of our investigation where we compare the performance and cost of vector 
microprocessors to that of aggressive, out-of-or ... 

APRIL: a processor architecture for multiprocessing 
Anant Agarwal, Beng-Hong Lim, David Kranz, John Kubiatowicz 

May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture, volume 18 issue 3 
Full text available' f£) pdf(1.38 MB) Additional Information: full citation , abstract , references , ci tings , index 
' ^ terms 

Processors in large-scale multiprocessors must be able to tolerate large communication 
latencies and synchronization delays. This paper describes the architecture of a rapid- 
context-switching processor called APRIL with support for fine-grain threads and 
synchronization. APRIL achieves high single-thread performance and supports virtual 
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dynamic threads. A commercial RISC-based implementation of APRIL and a run-time 
software system that can switch contexts in about 10 cycles is described. Me ... 

8 LISP on a reduced-instruct ion-set-processor Q 
Peter Steenkiste, John Hennessy 

August 1986 Proceedings of the 1986 ACM conference on LISP and functional 
programming 

Full text available: ^| pdf(953.13 KB) Additional Information: full citation , references , citings 



9 Value-based clock gating and operation packing: dynamic strategies for improving 

processor power and performance 
David Brooks, Margaret Martonosi 

May 2000 ACM Transactions on Computer Systems (TOCS), Volume is issue 2 

Full text available- 1Bpdff210.51 KB) Additional Information: full citation , attract, references , dtmgs, index 

terms 

The large address space needs of many current applications have pushed processor designs 
toward 64-bit word widths. Although full 64-bit addresses and operations are indeed 
sometimes needed, arithmetic operations on much smaller quantities are still more 
common. In fact, another instruction set trend has been the introduction of instructions 
geared toward subword operations on 16-bit quantities. For examples, most major 
processors now include instruction set support for multimedia operation ... 



10 Optimal integrated code generation for clustered VLIW architectures Q 
Christoph Kessler, Andrzej Bednarski 

June 2002 ACM SIGPLAN Notices , Proceedings of the joint conference on Languages, 
compilers and tools for embedded systems: software and compilers for 
embedded systems, volume 37 issue i 

Full text available: ^|pdf(227.07 KB) Additional Information: full citation, abstract , references , index terms 

In contrast to standard compilers, generating code for DSPs can afford spending 
considerable resources in time and space on optimizations. Generating efficient code for 
irregular architectures requires an integrated method that optimizes simultaneously for 
instruction selection, instruction scheduling, and register allocation. We describe a method 
for fully integrated optimal code generation based on dynamic programming. We introduce 
the concept of residence classes and space profiles 

Keywords: dynamic programming, instruction scheduling, instruction selection, integrated 
code generation, register allocation, space profile 



11 Computation techniques for FPGAs: An FPGA-based VLIW processor with custom 
hardware execution 

Alex K. Jones, Raymond Hoare, Dara Kusic, Joshua Fazekas, John Foster 
February 2005 Proceedings of the 2005 ACM/SIGDA 13th international symposium on 
Field-programmable gate arrays 

Full text available: ^ pdf(220.52 KB) Additional Information: full citation , abstract, references , index terms 

The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices 
continues to increase with each new line of devices. Efficiently programming these devices 
is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally 
targeted to embedded DSP microprocessors such as signal and image processing 
applications.This paper presents an architecture that combines VUW (Very Large 
Instruction Word) processing with the capability to introduce applicat ... 

Keywords: NIOS, VLIW, compiler, kernels, parallelism, synthesis 
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12 Multimedia and graphics: Enhancing loop buffering of media and telecommunications 

applications using low-overhead predication 
John W. Sias, Hillery C. Hunter, Wen-mei W. Hwu 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium on 

M i croa rch i tect u re 

Full text available: ^ M . 0 _ _ ffjjl 

^pat(i.37 MB)*tr Additional Information: full citation , abstract , references , citings 

Publisher Site 

Media- and telecommunications-focused processors, increasingly designed as deeply 
pipelined, statically-scheduled VLIWs, rely on loop buffers for low-overhead execution of 
simple loops. Key loops containing control flow pose a substantial problem— full predication 
has a high encoding overhead, and partial predication techniques do not support if- 
conversion, the transformation of general acyclic control flow into predicated blocks. Using 
a set of significant media processing benchmarks, drawn fr ... 

13 Static single assignment form for machine code 
Allen Leung, Lai George 

May 1999 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1999 conference 

on Programming language design and implementation, volume 34 issue 5 

_ ii , i -, ui « a Additional Information: full citation , abstract , references , citings, index 

Full text available: TOpdf(!31 MB) — ' 

terms 

Static Single Assignment (SSA) is an effective intermediate representation in optimizing 
compilers. However, traditional SSA form and optimizations are not applicable to programs 
represented as native machine instructions because the use of dedicated registers imposed 
by calling conventions, the runtime system, and target architecture must be made explicit. 
We present a simple scheme for converting between programs in machine code and in SSA, 
such that references to dedicated physical registers ... 

14 Experience with fine-grain synchronization in MIMD machines for preconditioned 

conjugate gradient 

Donald Yeung, Anant Agarwal 

July 1993 ACM SIGPLAN Notices , Proceedings of the fourth ACM SIGPLAN symposium 

on Principles and practice of parallel programming, volume 28 issue 7 
Full text available- fE) P df(1.29 MB) Additional Information: full citation , abstract, references , dflngs, index 
^ terms 

This paper discusses our experience with fine-grain synchronization for a variant of the 
preconditioned conjugate gradient method. This algorithm represents a large class of 
algorithms that have been widely used but traditionally difficult to implement efficiently on 
vector and parallel machines. Through a series of experiments conducted using a simulator 
of a distributed shared-memory multiprocessor, this paper addresses two major questions 
related to fine-grain synchronization in the cont ... 

15 The KScalar simulator 

J. C. Moure, Dolores I. Rexachs, Emilio Luque 

March 2002 Journal on Educational Resources in Computing (JERIC), volume 2 issue l 
Full text available: ^ pdf(493.35 KB) Additional Information: full citation , abstract , references , index terms 

Modern processors increase their performance with complex microarchitectural 
mechanisms, which makes them more and more difficult to understand and evaluate. 
KScalar is a graphical simulation tool that facilitates the study of such processors. It allows 
students to analyze the performance behavior of a wide range of processor 
microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, 
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superscalar pipeline with non-blocking caches, speculative execution, and comp . 



Keywords: Education, pipelined processor simulator 



16 High-cost CFD on a low-cost cluster 

Thomas Hauser, Timothy I. Mattox, Raymond P. LeBeau, Henry G. Dietz, P. George Huang 
November 2000 Proceedings of the 2000 ACM/IEEE conference on Supercomputing 
(CDROM) 

Full text available: « ffj]] 

Tgpgf(4.0Q MB) w Additional Information: full citation , abstract , references , index terms 
Publisher Site 

Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique 
for the future of computational fluid dynamics (CFD) in engineering applications. However, 
DNS requires massive computing resources. This paper presents a new approach for 
implementing high-cost DNS CFD using low-cost cluster hardware. After describing the DNS 
CFD code DNSTool, the paper focuses on the techniques and tools that we have developed 
to customize the performance of a cluster ... 

17 INSIDE: INstruction Selection/Identification & Design Exploration for Extensible 
Processors 

Newton Cheung, Sri Parameswaran, J?Henkel 

November 2003 Proceedings of the 2003 IEEE/ACM international conference on 
Computer-aided design 

Full text available: ^ pdf(248.23 KB) Additional Information: full citation , abstract , index terms 

This paper presents the INSIDE system that rapidly searchesthe design space for extensible 
processors, given area and performance constraints of an embedded application, while 
minimizing the design turn-around-time. Our system consists ofa) a methodology to 
determine which code segments are mostsuited for implementation as a set of extensible 
instructions,b) a heuristic algorithm to select pre-configured extensibleprocessors as well as 
extensible instructions (library), and c)an estimation tool ... 

13 HW/SW co-design: Instruction set and functional unit synthesis for SIMP processor 
cores 

Nozomu Togawa, Koichi Tachikake, Yuichiro Miyaoka, Masao Yanagisawa, Tatsuo Ohtsuki 
January 2004 

Full text available: ffi|pdf(161.66 KB) 

JsT ^ Additional Information: full citation , abstract , references 

W p Publisher Site 

This paper focuses on SIMD processor synthesis and proposes a SIMD instruction 
set/functional unit synthesis algorithm. Given an initial assembly code and a timing 
constraint, the proposed algorithm synthesizes an area-optimized processor core with 
optimal SIMD functional units. It also synthesizes a SIMD instruction set. The input initial 
assembly code is assumed to run on a full-resource SIMD processor (virtual processor) 
which has all the possible SIMD functional units. In our algorithm, we i ... 

19 MSIM: an improved microcode simulator 
Simeon Simeonov, G. Michael Schneider 
June 1995 ACM SIGCSE Bulletin, volume 27 issue 2 

Full text available: ^pdf(515.55 KB) Additional Information: full citation , index terms 
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Warp: an integrated solution of high-speed parallel computing 
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S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross 

November 1988 Proceedings of the 1988 ACM/IEEE conference on Supercomputing 

Full text available- 1 BpdH1.35MB) Additional Information: full citation , abstract, references , citings, index 
* terms 

iWarp is a system architecture for high speed signal, image and scientific computing. The 
heart of an iWarp system is the iWarp component: a single chip processor that requires 
only the addition of memory chips to form a complete system building block, called the 
iWarp cell. Each iWarp component contains both a powerful computation engine (20 
MFLOPS) and a high throughput (320 MBytes/sec), low latency (100-150 ns) 
communication engine for interfacing with other iWarp cells. Because of its s ... 
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